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Abstract 

We study a heteropolymer model with random contact interactions introduced some 
time ago as a simplified model for proteins. The model consists of self- avoiding walks on 
the simple cubic lattice, with contact interactions between nearest neighbor pairs. For 
each pair, the interaction energy is an independent Gaussian variable with mean value 
B and variance A 2 . For this model the annealed approximation is expected to become 
exact for low disorder, at sufficiently high dimension and in the thermodynamic limit. We 
show that corrections to the annealed approximation in the 3-d high temperature phase 
are small, but do not vanish in the thermodynamic limit, and are in good agreement with 
our replica symmetric calculations. Such corrections derive from the fact that the overlap 
between two typical chains is nonzero. We explain why previous authors had come to 
the opposite conclusion, and discuss consequences for the thermodynamics of the model. 
Numerical results were obtained by simulating chains of length N < 1400 by means of the 
recent PERM algorithm, in the coil and molten globular phases, well above the freezing 
temperature. 



1 Introduction 



Apart from their extreme biological importance, proteins are also very interesting objects from 
the point of view of statistical mechanics. They possess a very well denned native structure 
which they are able to find in a short time among a potentially huge number of competing 
ones, and in spite of many metastable states. How proteins reconcile the stability of the native 
structure with the requirement that this structure is rapidly reached constitutes the essence of 
the fascinating and still open protein folding problem J .] 

An interesting question is whether the property of folding is a generic property of ran- 
domly assembled polypeptidic chains, regardless of their biological function, or is a special 
property that has evolved through natural selection. This kind of question makes the protein 



folding problem a bridge between theoretical biology and the statistical mechanics of disor- 
dered systems. Motivated by this, numerous authors have studied simple models of random 
heteropolymers @, §, f§ §, §, 0, §, |, [TT| [XT], 0, 0, [Hj, 0, |T§, see |TJ for a review. 



In the following, we shall discuss only the 'random bond model' introduced independently 
by Garel and Orland |§ and by Shakhnovich and Gutin Q. More precisely, in our numerical 
simulations we will study a lattice version of this model. Preliminary results of this work have 
already been presented in [f20fl . A "protein" with N + 1 "amino acids" is represented as a 
self-avoiding walk of N steps on the simple cubic lattice. Each pair (i, j) of non-bonded 
monomers on nearest neighbor lattice sites contributes to the total energy an amount given 
by an independent and identically distributed (iid) Gaussian variable Bij with mean B' and 
variance A' 2 . Formally, one defines the contact map of configuration C, <j(C), as the matrix of 
binary variables G {0, 1}, with i, j G {0, • • • N}, such that 



(TiAC) 



1 if i and j are in contact and non-bonded 
otherwise. 



The energy of the model can then be written as 



E(C,{B})=J2^ J (C)B lJ , (2) 

i<j 



with B i:j = B', B% - B^ = A' 2 . For a given realization of the interaction energies Bi 



-2 

(representing a protein sequence in the biological analogy), the partition sum at temperature 
T can be formally computed as 



Z N {B lJ } = Y,z- E{C ' {B})/kBT 1 (3) 
c 

where the sum over configurations C runs over all self-avoiding iV-step walks. Obviously, the 
above expression depends only on the variables A = A'/ksT and B = B'/ksT, i.e. we have 
a two-parameter phase diagram in the variables B and A. The main advantage in using B as 
one of the independent variables instead of T or (3 = 1/ksT is that we can pass continuously 
from positive (repulsive, hydrophilic) to negative (hydrophobic) B. 

As usual with random models, we have to evaluate the quenched average of the free energy. 
This is a very difficult task, while it is rather easy to perform an annealed average over the 
disorder. For several models of random spin systems it is well known that such an annealed 
approximation becomes exact in the high temperature phase, in the thermodynamic limit and 
at sufficiently large dimension. The same is thought to be true for the present model. It was 
indeed predicted in || that the annealed approximation becomes exact in 3 dimensions when 
the chain length tends to infinity. For this to be true it is necessary that the overlap between two 
randomly chosen replicas with the same realization of disorder vanishes in the limit iV — > oo. 

Numerical tests of this prediction have been made in the past for chains of length < 36, 
mostly by means of exact enumerations of maximally compact chains of length 27 [T3, 15]. 



These authors found deviations (replica overlap is non-zero) which seemed to decrease with N. 
A similar result even for d = 2 was found in ||18|| , where also Monte Carlo simulations of very 



short chains were used (up to iV = 64). But it is clear that tests with such short chains can 
hardly be significant. In the present paper we shall present Monte Carlo simulations for chains 
of length up to iV = 1400. These simulations are made with the PERM algorithm developed 



recently by one of us P3[ , and applied successfully to a number of different polymer problems 

We study the corrections to the annealed approximation using two different approaches. 
First, we compute them using the replica method and assuming replica symmetry, which is 
believed to hold for low disorder. Even if a full computation was not possible, the expected 
behavior was well confirmed by numerical simulations. Second, we notice that corrections to the 
annealed approximation in the weak disorder limit can be related exactly to the average overlap 
between pairs of homopolymers (without any disorder). We give strong theoretical arguments 
that this overlap does not vanish in the limit N — > oo. We also calculate it by means of Monte 
Carlo simulations. Unlike in the previous case, these simulations do not involve the averaging 
over the disorder and thus can be applied to larger systems. 

The two methods agree with each other and show that the corrections to the annealed 
approximation are small in d = 3, but do not vanish in the thermodynamic limit. Deviations 
from the annealed approximation are larger in the coil (high-temperature) phase and very small 
in the collapsed (globular) phase. 

The annealed approximation is presented in sec. 2 and compared to results of Monte Carlo 
simulations. In order to explain the observed deviations, we study in sec. 3 a scenario where the 
overlap is non-zero but replica symmetry is unbroken. We again compare theoretical predictions 
with simulation results. The relationship between the weak disorder limit and homopolymer 
overlap is discussed in sec. 4. Additional thermodynamic considerations are presented in sec. 5, 
and our final conclusions are drawn in sec. 6. The PERM algorithm used for the simulations is 
discussed in an appendix. 



2 Annealed Approximation 

In thermodynamic systems with quenched disorder we have to consider the average of the free 
energy per monomer over individual realizations of disorder {-B^} which formally is given by 

F N (B,A) = -— \og{Z N {B l3 })) = -—H j dB l3 Z N {B %j \ . (4) 

As for most random systems, this cannot be evaluated in closed form. Much easier to 
evaluate is the annealed approximation 

F N>ann (B,A) = -^ \ogZ N (5) 

obtained by taking the disorder average before taking the log. Here the Gaussian integrals can 
be done explicitly, with the result 

Z N = Y,e- {B -^ 2) ^<^ {C) • (6) 
c 

Since this is the partition sum for a homopolymer with pair energy 

B = B- ^A 2 , (7) 



we see that [|] 

F NtWa (B,A) = F N {B,0) . (8) 

Therefore, all thermodynamic variables can be expressed in the annealed approximation 
in terms of an equivalent homopolymer with shifted interaction strength. This relationship 
is easiest for those observables whose definition does not involve a derivative with respect to 
temperature, such as the gyration and end-to-end radii, and the density of non-bonded nearest 
neighbor contacts c. The latter is defined as the average number of nn. contacts between 
non- consecutive monomers divided by N. For these observables, we have 

R N ^ nn (B,A) = R N (B,0) (9) 

and 

c wn (B,A) = c(B,0)=c, (10) 

precisely as in eq.(H). 

For energy U and entropy S the relations are less simple, since these involve derivatives of 
the free energy with respect to T which are changed into derivatives with respect to B and A 
by our convention of using T — 1. For the energy per monomer it holds 

B — A 2 ~ ( B — A 2 \ 

U N , aIm (B, A) = —^—U N (B, 0) = ( — g— j 5 > (11) 

where we used the fact that the energy for homopolymers is Un{B,0) = cB. For the specific 
entropy S N (B, A) = -JLf n {B, A, T)\ T=1 we use F N (B, A, T) = 
TF N (B /T, A/T, 1) together with eq.(^), and obtain 

S Niaan (B, A) = S N (B, 0) - ^U N (B, 0) . (12) 

IB 

The number of configurations with fixed c should increase as exp(Nf(c)) for large N, i.e. 
f(c) is the entropy density in the fixed- N, fixed-c ensemble. For homopolymers, the ensemble 
with fixed B becomes equivalent to the fixed-c ensemble in the limit iV —>■ oo, Thus c becomes 
a non-fluctuating function of B, c = c(B) = c, and the above formula becomes simply 

Sn^B, A) = /(c) - ^c foriV - oo , (13) 
where c(B) is the solution of the saddle point equation 

m = ^-\^=B. (M) 



The condition for thermodynamic stability is that the second derivative of / should be 
negative, corresponding to F being minimal. This is equivalent to requiring that the specific 
heat is positive. In fact, the specific heat for a homopolymer is given by 
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Figure 1: Collapse transition line. The solid line is the annealed prediction, B — A 2 / 2 = Bg = 
—0.269. Numerical data are obtained by measuring the end-to-end distance. At A > 2.5 (not 
shown) there are significant deviations from the annealed approximation, that are most likely 
due to a direct freezing from the swollen phase. 



which has been obtained by deriving both sides of Eq. (|14j) with respect to T. 

Homopolymers with attraction between unbonded nearest neighbors show a collapse ( "theta" ) 
transition where the specific heat diverges. Thus we expect that the second derivative d 2 f/dc 2 
vanishes at the theta-point c = eg (the precise value of the transition point depends on the 
lattice considered). 

The annealed approximation is supposed to be valid both above and below the theta tran- 
sition. At very low temperatures and very large disorder, it has to break down since otherwise 
the entropy would become negative, according to eq. (|l2"D . This signals another phase transition, 
the so-called freezing transition. We shall not discuss this regime in this paper, but will treat 
it in a forthcoming publication. 

Since the theta point is a tricritical point [^T|, |25| , its upper critical dimension is d=3. 
Therefore, we expect that in three dimensions the "swelling factor" is constant, 



(R 2 )/N « const 



(16) 



at the theta point, up to logarithmic corrections |30|, |31 



Here, R is any measure of the size 
of the polymer, such as the end-to-end distance or the gyration radius. We expect that this is 



still true for heteropolymers, as long as we are not yet in the frozen regime. While eq. ([16|) gives 
the most precise numerical estimate of the collapse transition (with Bg = —0.2690 ± 0.0002 



[25]), estimates with similar precision can be obtained from the convexity of the free energy 



28], and the volume dependence of the free energy in case of periodic boundary conditions [26 



The collapse line in the (B, A)-plot obtained from simulating chains of length N < 1000 is 
shown in Fig.l. Here the solid line is the annealed approximation. We see that on this scale the 
annealed approximation seems perfect. But this is not quite true. Much more precise tests can 
be performed by comparing directly both sides of eqs.(H) to (|PTD. Typical plots obtained in this 



way are shown in Figs.[| For each of these plots we used at least 300 realizations of disorder, 
and we got at least 10 3 independent configurations for each disorder realization. Similar plots 
were made also for several other values of B and A. 
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Figure 2: Free energies per monomer (top left), end-to-end swelling factors (R 2 )/N (top 
right), density of contacts c (bottom left), and absolute values of energies per monomer Un 
(bottom right) for four systems with A = 0.5, and with B = 0, —0.20, —0.269, and —0.345. In 
the two top figures \B\ increases from top to bottom, in the bottom figures it increases from 
bottom to top. Full lines are from Monte Carlo simulations, dashed lines are predictions of the 
annealed approximation. In all panels, error bars are much smaller than the thickness of the 
lines. The values B = and —0.20 are in the swollen phase, —0.345 is in the collapsed phase, 
and —0.269 is on the theta line. 

In all these plots we see small but significant deviations. These deviations are present both 
in the collapsed (globular) and in the open (coil) phase. They depend only weekly on N. 
Therefore, even with our long chains and high statistics it is not clear whether they disappear 
for N — > oo. Obviously, in order to proceed we need more refined theoretical predictions to 
compare with, and/or a more efficient way to do the disorder average. 

Before we do this, we should point out that deviations from the annealed approximation 
were also found recently in a different model by Trovato et al. W^ - 



3 Replica Symmetric Approximation 



To go beyond the annealed approximation, we will use the replica trick 



\nZ 



N 



lim 



11 



(17) 



Alternatively, we could try a Taylor expansion 

1 



HZn/Z. 



N 



(z 2 N /z 



N 



D + 



This expansion is most likely divergent. It is nevertheless useful since its first term gives already 
a good indication of the leading corrections. Also, it suggests the inequality 



F N (B,A)>F N ^(B,A) 



(19) 



which can easily be derived exactly from convexity of the logarithm. The same inequality is 
expected to hold for Un- This inequality is equivalent to the existence of a negative correlation 
between the average energy and the partition function. 

To use eq.(|i7D, we first have to evaluate disorder averages of for integer n > 2. These are 
performed similarly to the average over Zn, except that the Gaussian integrals give formally 
rise to interactions between replicas 0, 



£ ex p 



A 2 



a=l \i<j 



(20) 



\i<j 



Here the Greek indeces a and /3 refer to the different replicas, C a is a configuration of replica a, 
afj is its contact map and B is given by equation (0). The annealed approximation is equivalent 
to neglecting the two-replicas term. 
To proceed, we define the variables 



i<j 



Qa/3 



1 a 8 



(21) 



which are respectively the density of contacts for the contact map a and the overlap between 
two contact maps a and f3. The overlap is a measure of similarity, and it is equal to one if and 
only if the two contact maps coincide. Furthermore, we assume that, for large N, the number 
of configuration n-tuples with Nc±, . . . Nc n contacts and mutual overlaps {q a s} grows as 



exp 



N 



[22) 



£ f(c a ) ~ £ Xk ({c a }, {q a p}) 

\a=l k=2 / 

In other words, Xk ({ c a}, {lap}) is the entropy loss per monomer when we impose that the 
replica Ck with density of contacts c a has overlaps q^k • ■ • qk-i,k with the k — l previous replicas. 
This quantity can be measured, for instance for k = 2. 
We can then write 



Z n w / d{c a }d{q aP ] x 



£(/( 

a=l 



(23) 



x exp < N 



Bc a ) + ^- I £ ^c^q a/3 -J2xk ({c Q }, Uaa}) 

- \a^B k=2 



exp[-NnF n ({c a },{q al3 })]. 



Here, F n {{c a }, {q a p}) is the free energy per monomer in a system with n replicas. To evaluate 
it, we approximate the integrals over c Q and q a p by their saddle points. We assume replica 
symmetry which is expected to hold for low disorder: the saddle point is assumed to be given 
by c a = c for all a and q a p = q for all pairs a ^ (3. 

Now, in order to obtain the correct free energy, we have to take the limit n — > 0. We obtain 

F{B, A) = -f(c) + Bc + l - A 2 cq - X {c, q). (24) 

where x(c, q) = — lim n ^ Ylk=2 Xn(c, q)/n is the average entropy gain per replica due to the 
condition that the overlap among all replicas is equal to q. Note that this quantity is positive 
because, in the limit of vanishing n, the number of terms in the sum is -1. Finally, we have to 
compute the values of c and q at which F is evaluated by imposing two saddle point conditions: 



^ + *M = s + i AV (25) 

Ia^^M> (26) 

For A = (homopolymer) Eq.(^) just means that the value of the overlap is the most 
probable one for a given c, qo(c). Because of the normalization, it must be x( c , Qo) — 0, thus 
the free energy of the homopolymer is just a special case of Eq.(^). Moreover, since x(c, go) — 
is an absolute minimum, also the derivative dx/dc must vanish at that point, thus the saddle 
point equation for c valid for the homopolymer is recovered for A = 0. It also follows from this 
argument that the second derivatives of x(c, q) at the point (c, <?o( c )) must be non- negative. 

Notice that the free energy has to be maximized as a function of q because this variable 
refers to a space with a negative number of dimensions in the limit n — > 0. We thus get a first 
condition of thermodynamic stability d 2 x/dq 2 > 0, which, from the above consideration, is 
expected to be fulfilled for A small enough. The situation is more complicated for the variable 
c. It enters both into the free energy of the replica interactions which has to be maximized 
for n — > 0, and into the free energy of the homopolymer which has to be minimized, at least 
for A = 0. We conjecture that the corresponding condition of thermodynamic stability is that 
the Hessian determinant of the free energy respect to the variables c and q, H(c,q), be non 
positive: 

^ ( dc 2 dc 2 ) dq 2 ( dcdq c dq) ~ ^ ^ ^ 

For A = we have H(c,q) = (d 2 f /dc 2 ) (d 2 x/dq 2 ) < as for homopolymers. In fact, 
at that point the first derivatives of x(c, q) vanish. The Hessian determinant vanishes also, 
because x(c, q) stays constant at the value zero along the line q = qo(c). Thus both conditions 
of thermodynamic stability are fulfilled at A small enough. 

The energy and entropy per monomer are obtained in the same way as in the annealed 
approximation. We find 

U N (B,A) = (B-A 2 (l-q))c, (28) 
S N (B, A) = f(c) + X (c, q) - ^-c(l - q). (29) 
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Figure 3: q(B, A) measured from the energy and the contact densities as a function of system 
size for A = 0.5 and four different values of = B (panel a) and for B = —0.345 and five different 
values of A (panel b). 



Although this is obtained from the saddle point method which is exact only for N — > oo, we 
can use Eq.(28) to obtain effective values of q also for finite N. Results are shown in Fig.[3]a. 
It appears that q decreases with system size, but its asymptotic value seems to be finite in 
the random coil phase. This was confirmed by similar measurements at different values of 
A. Thus the annealed approximation does not hold in the random coil phase. The situation 
is more difficult for collapsed chains. In this case it can not be excluded from Fig.[3]b that 
the overlap asymptotically vanishes, and thus the annealed approximation becomes exact in 
the thermodynamic limit. However, simulations of systems of even larger size that will be 
presented in the next section show that this is not the case and that the corrections to the 
annealed approximation remain finite in the thermodynamic limit in both the coil and the 
collapsed phases, although in the latter phase they are rather small. As seen from Fig.Qo, q 
does not depend very much on A in the collapsed phase and for large systems. 
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Figure 4: [x(B, A) + x{B,0)}/2 measured from Eq.([3~3l) as a function of system size for the 
same values of B and A as in FigS 



To obtain numerical estimates of Xi we subtract from Eq . (|24"D the analogous equation for 



homopolymers, and obtain 



F(B, A) - F(B, 0) = (c - c)B - [/(c) - /(c)] + -A 2 cg - X (c, g). (30) 
Expanding the difference /(c) — /(c) in the difference c — c and using f'{c) = B, we get 

F(5, A) - F(5, 0) = -~(c - c) 2 f (c) + ^A 2 cg - x (c, g) + O ((c - 5) 2 ) . (31) 

This can be either evaluated directly, neglecting the last term. Alternatively, we can eliminate 
the term involving /"(c) by subtracting from Eq.(p5|) the analogous equation for homopolymers, 
which gives 

(c - c)f (c) = \a\ - + O ((c - c) 2 ) . (32) 

Combining this with Eq.([31]) and neglecting terms O ((c — c) 2 ), we obtain finally 

F(B, A) - F(B, 0) = £±£ A 2 g - i [ X (c, g) + X (c, g)] + O ((c - c) 2 ) . (33) 

In Fig.[| we show numerical values for [x(c, g) +x(c, g)]/2 obtained in this way. We see that x is 
very small but definitely not zero. Again we see that corrections to the annealed approximations 
are larger in the swollen phase than in the collapsed phase. Indeed, this time it seems that 
the deviations have a finite limit for iV — > oo in both phases. This conclusion is supported by 
measurements at other values of A (not shown here). Basically the same conclusions are also 
drawn from Eq. (|3lD , showing that x( c ; q) depends weakly on c and that Eq.(|32"D is very well 
satisfied. 



4 Overlap of Homopolymers 



In this section we will discuss the overlap of contact matrices for homopolymer chains and their 
relation to corrections to the annealed approximation in the weak disorder limit. Consider the 
derivative of the free energy of a random heteropolymer with respect to A 2 , at A = 0. Using 
Eqs. (p~8|j20|) and the results of sec. 2, we obtain 



dF N (B,A)~ 




OA 2 


A=0 



Thus we have 



dlnZ 



N 



idz 2 



N 



OA 2 
d 



2 OA 2 



J A=0 



ldF N (B,ti) 1 

ZJV CiA i<j 



OB 



2 



N 



J A=0 



\{cq). 



(34) 



(35) 



This could have been obtained of course also within the replica symmetric approach, but the 
above derivation shows that it is indeed a rigorous result involving neither approximations nor 
unjustified assumptions. 

Notice that this cannot be generalized to A ^ 0, but in principle straightforward general- 
izations could be used to compute all higher derivatives 



OA 



2k 



(Fn — -FjV,ann) 



k 



1,2,3, 



(36) 



A=0 



Numerically, the rhs. of Eq.(^) can be estimated by simulating pairs of chains simultane- 
ously. For this we used a variant of the PERM algorithm where we add monomers alternatively 
to the first and to the second chain [32" ] . In this way we guarantee that both chains have exactly 
the same length (after having added an even number of monomers), and it is straightforward 
to estimate their overlap. 

Results from such simulations with chains of length up to 1400 are shown in Fig.[5[ These 
data agree nicely with extrapolations of the overlaps for A > shown in the last section. They 
have much smaller statistical errors, since we do not have to average over any disorder explicitly. 
This makes the present method much faster and allows us to study larger systems. 




Figure 5: Average overlap go = ( c q)/ ( c ) between homopolymer chains of length N for different 
values of the monomer-monomer attraction B. 

The curve for B = —0.2 > Be in Fig.[5] shows clearly that the annealed approximation does 
not become exact for N —>■ oo in the open coil phase. The same is true (although a bit less 
clear) exactly at the point, as indicated by the curve for B = —0.269. For the collapsed 
phase, the evidence is not so clear. The curves for B = —0.35 and for B = —0.5 both are much 
lower for large N and continue to decrease. Superficially, one might therefore conclude that 
the overlap disappears for iV — > oo. But both these curves show a distinct upward curvature 
for the largest values of N, indicating that the decrease will level off and go tends to a finite 
constant for N — > oo. To sustain this view, we show in Fig.^ the plots of (eg) as a function of 
chain length N. In this case it is evident that the curves are going to a non-zero value. Since 
the fraction of contacts is limited (it holds c < 2 on the cubic lattice), also go = (eg)/ (c) should 
go to a non-zero value, and the decrease observed in Fig.|5] is just a consequence of the fact that 
the average fraction of contacts is increasing, approaching its stationary value. 
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Figure 6: Average fraction of common contacts (cq) between homopolymer chains of length N 
for different values of the monomer-monomer attraction B. 

This conclusion is not, after all, very surprising. It can be backed by an argument which 
could presumably, with some effort, be made rigorous. The density of contacts in a self- avoiding 
walk (SAW, corresponding to B = 0) is dominated by short range contacts, i.e. by contacts 
with small \i — j\. The contribution of such a loop with fixed i and j depends weakly on 
the configuration of the chain far away from this monomer pair. Thus, if such a loop is present 
in one replica, it has a high chance to be present also in the other replica, even if the global 
structures of both replicas are entirely different. 

Notice that this argument depends crucially on the fact that it is the overlap between contact 
matrices which is of relevance, not the geometric overlap between configurations. It seems that 
the authors of [0 have missed this in an otherwise very similar argument which has led them 
to the opposite conclusion that the annealed approximation does become exact. 

In collapsed chains there are relatively more long range contacts, which explains why the 
annealed approximation is better - but still not exact - in that regime. For random heteropoly- 
mers with strong disorder, and even more so for proteins, this argument suggests an increased 
overlap because of secondary structure. For instance, an alpha helix produces an array of con- 
tacts (i, i + 4), (i + 1, i + 5), • - •, where % labels the position of the amino acid along the protein 
chain. This enhances the overlap. Moreover, there is a finite probability that such contacts 
appear simultaneously even in the structures of two unrelated proteins. Thus the average over- 
lap even in a large set of unrelated protein structure appears to attain a finite limit when the 
length of the chains increases [23 ] . 



5 Thermodynamics in the replica symmetric approxima- 
tion 



The two saddle point equations for c and q can not be explicitly solved, because we lack an 
explicit expression for the functions /(c) and x( c )<?)- Nevertheless, their qualitative behavior 
can be studied in more detail. This is done in the present section. 



Deriving both equations ( p5| , |26D with respect to the thermodynamic parameters B and A 
we can compute the derivatives of c and q as 
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where H(c, q) is given by Eq. 
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Prom these, the specific heat can be computed as 



(37) 
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where F(c, q) is the free energy evaluated at the saddle point. The three terms in the square 
brackets are the quadratic form whose determinant is expressed by H(c,q). Since d 2 F/dq 2 is 
positive, they would give a negative contribution to the specific heat if H(c, q) were positive. 
Thus it is justified to require that H(c, q) is negative in order to get thermodynamic stability. 

The behavior of the thermodynamic derivaties (|3~7D can be partly studied using the fact that 
the function x( c )<?) attains its absolute minimum value x(c, q) = along the line q = qo{c). 
Thus, assuming that it is an analytic function of q for q > 0, it can be expressed in the form 
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(39) 



with a 2 (c) > 0. The typical overlap qo(c) is a small quantity and it is a decreasing function 
of c, or q' (c) < (the prime indicates derivative with respect to c). The coefficients dfc(c) are 



expected to be quantities of order (g (c))" 
the notation Q = cq, Qq(c) = cgo(c), A^{c 



-ifc+i 



as it will be argued later. We also introduce 
a>k(c)c~ k . We can now develop the saddle point 



equations for c close to the solution c of the annealed approximation given by f'(c) = B: 
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(40) 



From these expressions one sees that both Q — Qo(c) and c — c are quantities of order Qo(c), 
thus corrections to the annealed approximation are small but finite for finite A. Moreover, 
5q = (q — qo(c)) is positive for small A. From Eq.(P^) we find, to zero-th order in 5q: 
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H(c, q) must be computed at the first order in Sq, because the zero-th order term vanishes at 
the theta point c = eg at which d 2 f/dc 2 vanishes. The result reads: 

H(c, « " « " Go(c)) c(A 2 (c)) 2 ^. (43) 

To proceed further, we first consider the simple case where the entropy x(c, 9) depends only 
on the product Q = cq representing the number of conditions that we have to impose in order to 
fix an overlap q: x(c, q) = x{cq). This form was assumed in the work of Shakhnovich and Gutin 
Q. As it is easy to see, under the above hypothesis c = c(B) coincides with the value predicted 
by the annealed approximation and it is independent of A at fixed B. Thus, in this case the 
annealed approximation would yield the correct value for the fraction of contacts. However, our 
numerical results contradict this prediction. The overlap q is in this case an increasing function 
of A, and its derivative with respect to B is proportional to (d 2 f/dc 2 ) , which is expected to 
diverge at the theta point c = eg. This fact can explain why the values of q are much smaller 
in the collapsed phase than in the coil phase. 

A better theory for the function x(c, q) was developed by Plotkin, Wang and Wolynes 
[T3| . They computed the entropy loss for two replicas with density of contacts c being at 



overlap q, Xz{ c i q), in the mean field approximation and in the collapsed phase. Although 
this can be different from x(c, q), it is a good point for understanding its qualitative behavior. 



Unfortunately, we can not use the formula obtained in |13| because of several reasons: first, the 
result is valid for finite N and becomes pathologic as iV — ► 00; second, it is not given in the 
form of a differentiable function; and third, it is assumed that the most probable overlap qo(c) 
is zero in the thermodynamic limit, while our calculations show that this is not the case. We 
shall however use the fact that x(c, q) is the sum of three contributions: the entropy loss due to 
imposing that Ncq contacts have to coincide, the loss due the fact that Nc(l— q) contacts have to 
be different, and a combinatoric factor counting the number of different choices of Ncq contacts 
among Nc. The last term can be approximated by Xmix(c, q) = c [q \nq + (1 — q) ln(l — q)}, even 
if this is an overestimate, since not all of the combinations of different common contacts can 
be realized. Putting everything together we have: 



X(c, q) = x(c, cq) + c [gin q + (1 - q) ln(l - q)) . (44) 

In the computation by Plotkin et al. the mixed second derivative of x(c, cq) with respect 
to Q = cq and c vanishes. This simplifies considerably formulas, and it will be assumed to 
hold for the rest of the paper. We shall thus introduce the notation x'{Q) to denote the 
derivative of x(c, Q) with respect to Q = cq at fixed c. Comparing Eq.([0|) to Eq.(|39|), we 



see that x"{Q) m ust be positive and that dx/dc = — (1 — 2q ) log(l — q ). We also see that 
a 2 (c) = c/(g (l — %)) + x"(Q) is likely to be a quantity of order q$\ as it has been assumed 
above. For the higher order coefficients one finds a^ = O (% k+1 ^ , as anticipated. We have now 
to compute the derivatives of Qo(c) = cg (c): 

dQo d 2 x/dcdQ = % 

dc d 2 X /dQ 2 l + cq x"(cq )(l-q )- ' 1 J 

in qualitative agreement with Fig.[]. Inserting the above result in the formulas (E2]) we see 
that the density of contacts decreases with A at fixed B. This behavior is confirmed by our 
numerical results (see Fig.|8]), which also show that the decrease is maximal for B « Bg = —0.27, 
as expected from the fact that H(c, qo(c)) vanishes at c = eg. The overlap q increases with B at 
fixed A, as expected from Eq.([42"|) (see Fig.|7|), and increases with A at fixed B, as expected from 
Eq.(fE^) (see Fig.[7| again). It can thus be understood why the overlap decreases with system 
size: as the number iV of monomers increases the importance of surface effects is reduced (as 
iV -1 / 3 ) and c(N) increases, thus decreasing the value of q. 

We now examine the condition of thermodynamic stability, H(c,q) < 0. As it was already 
observed, since at the point (c, g ( c )) both the gradient of x(c, q) and its Hessian determinant 
vanish, we have H(c,q (c)) = (d 2 f/dc 2 )(d 2 x/dq 2 ) < 0. At the theta point this quantity 
vanishes, and H(cg, q) is given by the deviations from the annealed approximation. Three 
situations are possible: First, H(cg,q) can be positive at the leading order in 5q = q — qo(cg). 
In this case the thermodynamic stability would be violated around the theta point, but our 
simulations do not show anything strange in this region. Second, the leading order in Sq can be 
negative. In this case, the specific heat would not diverge anymore at the theta point for finite 
A, but it would show a peak proportional to some negative power of Sq. Thus the disorder 
would smear out the thermodynamic singularity at c = eg, leaving unchanged the geometric 
characterization of the collapsed chains in terms of the gyration radius. It is rather difficult, 
if not impossible, to test this scenario by means of simulations. Third, H(cg, q) can vanish 
identically at c = eg. It is easy to see that this condition, combined with the assumption that 
x(c, q) is of the form (fHj), is fulfilled if and only if x'(cq) is of the form 

X\cq) = log (^r^) , (46) 

where q > qo(c), A < and < B < cq are two constants, and c is not too small so that the 
last inequality can be fulfilled. In this case, one finds cgo(c) = B(c — A)/(B — A) e [0,c], and 
it is easy to check that all previous results are recovered, while H(c,q) — H(c,qo(c)) vanishes 
for all q and c, including the theta point. 

Summarizing the discussion, we find that, if x{ c i q) is °f the form (pH|), two possibilities are 
open: either x'( c <?) is given by Eq.(f46|), in which case H(c,q) = H(c,qo(c)) < 0, or x'( C( l) has 
a different form, in which case the specific heat is not anymore divergent at the theta point 
c = eg. Unfortunately, we are not able to decide among these alternatives. 

We conclude this section showing in Fig.[|, for the sake of completeness, the entropy loss 
Xi{c,q) obtained from simulations of homopolymer chains with iV = 27. Although the system 
size considered is definitely too small for quantitative considerations, the qualitative behavior 
is interesting and confirms the above considerations. The entropy x^{c, q) was computed from 
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Figure 7: Left: overlap q as a function of A 2 for different values of B. Right: entropy [x(B, A) + 
x{B, 0)]/2 measured from Eq (|33D as a function of A 2 for different values of B. 



X2(c,q) 



N 



logp 2 (c, q), 



(47) 



where P2{c,q) is the probability density for a pair of chains, both with density of contacts c, 
being at overlap q. 



6 Discussion 

We have shown that the annealed approximation is very good but not exact for a particular 
model of random heteropolymers, and we have given simple physical arguments for it. We have 
also computed the thermodynamics of the model using the replica symmetric approximation, 
and we have shown that such approach can explain very well, at least qualitatively, the observed 
deviations from the annealed approximation in the high temperature phase. The replica sym- 
metric calculation also leaves open, surprisingly, the possibility that the disorder could cancel 
the thermodynamic singularity at the theta point. A numerical test of this possibility is very 
difficult, and it has been left out. 

In our present study we have not addressed the most interesting aspect of the model - the 
freezing of the system in a finite number of mesoscopic states. This transition should represent 
some features of the folding transition taking place for protein structures. Instead, we have 
studied the model at higher temperatures and at smaller disorder. This should however be 
of interest also in the context of the freezing, since it was conjectured [Q that freezing can 
be described in this model by the random energy model, a prerequisite for which is that the 
annealed approximation is exact. 

In the present simulations we have studied chains of length up to iV = 1400. Deviations 
from the annealed approximation decrease fast for small N, which explains why studying very 
short chains has mislead several authors to the conclusion that these deviations vanish for 
iV — > oo. But in the high-T (open coil) phase this decrease clearly stops, and deviations are 
roughly independent of N for N > 100. This is less clear in the collapsed phase. But also there 
numerics, general arguments, and detailed calculations within a specific scenario with unbroken 
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Figure 8: Corrections to the density of contacts predicted by the annealed approximation as a 
function of A, for different values of B and N = 800. The deviations are maximal close to the 
theta point B « -0.27. 

replica symmetry all indicate that these deviations will settle at a non-zero value for large N. 
This casts doubts on the validity of the random energy picture for protein folding. 

There are of course a number of questions which are left open by the present study. First 
of all, we have deliberately left out all questions related to freezing. Secondly, our treatment 
of sec. 3 assumed that the overlap distribution is always dominated by a single peak. This 
is most likely not true in the frozen regime, and the distribution of overlaps is certainly a 
most interesting object. We shall address these questions in a forthcoming paper P5| . Finally, 
we have studied only one particular model, where contact energies are independent Gaussian 
variables. Several other models of random heteropolymers are studied in the recent literature 



| I9| , and several of them present very intersting open problems. 



We are indebted to many colleges for useful discussions, in particular to H. Frauenkron, W. 
Nadler, H. Orland, E. Shakhnovich, A. Trovato and M. Vendruscolo. 

Appendix: The Pruned Enriched Rosenbluth Method 



This method which was first described in detail in p5[ is a chain growth method. It is based on 



Rosenbluth's idea of biased sampling |33|, but it deviates from it by deleting ('pruning') config- 
urations with too low weight, and copying configurations with too large weight ('enrichment'). 
We remind the reader that the bias in the Rosenbluth Method requires each configuration to 
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Figure 9: Entropy loss of a pair of chains with the same value of c at overlap q, computed as 
the logarithm of the probability density of the overlap. Chains have N = 27 monomers. 



carry a non-trivial weight which builds up gradually as monomer by monomer is added. In 
addition to this 'Rosenbluth factor', there is also a Boltzmann factor in the case of interacting 
polymers. The weights which control pruning/enrichment are the product of the two. 

Both pruning and enrichment are done while the chains grow, i.e. based on the actual 
(incomplete) weight factors. In some cases it is advantageous to use not the present weights 
but (implicit) estimates of future weights to control the algorithm [31], but this is not done in 
the present paper. Instead, we use some of the special tricks used for strongly collapsed systems 
27j 



in 



When the weight is too low, configurations are not simply killed (this would imply systematic 
errors), but instead they are killed with probability 1/2, and those which are not killed get their 
weights doubled. Similarly, in the case of cloning the weight is spread uniformly among all 
clones. Technically, cloning is performed by means of recursive subroutine calls. A pseudocode 



of the basic algorithm is given in the appendix of |25| . 

The most important shortcoming of the Rosenbluth Method is that the distribution of 
weights can become extremely wide for large systems and at low temperatures. The only 
exception is for interacting homopolymers on the simple cubic lattice at the theta point, where 
Rosenbluth and Boltzmann factors nearly cancel. In less favorable cases even a very large 
statistical sample can be dominated by just a handful of high weight events, and statistical 
errors grow out of bounds. Even worse, in extreme cases the events which would carry (in 
average!) most of the weight are so rare that they are missed completely with high probability, 
and the free energy is underestimated systematically. 

Pruning and enrichment guarantee that the weights of individual configurations stay within 
narrow bounds, and the above cannot happen. But in very difficult situations (large N, low T, 



large disorder) it may happen that due to cloning the configurations are strongly correlated, 
and the weights of clusters of such correlated configurations play essentially a similar role as the 
weights of individual configurations in the above discussion. In the following we will call such a 
cluster which is composed of all configurations having a common root a 'tour'. In order to check 
that the weights of tours do not become too uneven, we have measured in their distribution. Let 
us call the weights W, and the distribution P(W). We are on safe grounds if this distribution 
is so narrow that P(W) and WP(W) have basically the same support. In particular, we should 
require that the maximum of WP(W) occurs at such values where P(W) is still appreciable, 
and the distribution is well sampled. We have verified that this is the case for all data shown in 
the present paper. Notice, however, that this is a very stringent requirement. If it is were not 
satisfied, this would not necessarily mean that the data are wrong, since configurations within 
one tour are only partially correlated. 

The generalization of this algorithm to pairs of chains growing simultaneously is straightfor- 



ward [32]. One just has to add monomers alternatingly. When the total number of monomers 
in both chains is even, the addition of the next monomer is attempted at chain 1; when this 
number is odd, the next monomer is added to chain 2. 
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